Entity Resolution in a Big Data Framework
نویسنده
چکیده
Resource Description Framework (RDF)1 is a data model that can be used to publish semistructured data visualized as directed graphs. An example is Dataset 1 in Fig. 1. Nodes in the graph represent entities and edges represent properties connecting these entities. Two nodes may refer to the same logical entity, despite being syntactically disparate. For example, the entity Mickey Beats in Dataset 1 is represented by two syntactically different nodes. Entity Resolution (ER) is the problem of resolving such semantically equivalent entities by linking them using a special sameAs property edge (Ferraram, Nikolov, and Scharffe 2013). The ER problem is not restricted to the RDF data model but can be stated abstractly as identifying and resolving semantically equivalent entities in one or more datasets (Elmagarmid, Ipeirotis, and Verykios 2007). As an example of applying ER on relational datasets, consider Datasets 2 and 3 in Fig. 1. In these datasets, an entity is represented as a tuple. The goal is to identify duplicate tuples, that is, tuples referring to the same logical entity. ER is an important AI problem that has been acknowledged as occurring in structured, semistructured and even unstructured data models. A survey on the subject cites at least eight different names for the problem, including record linkage, instance matching, link discovery and co-reference resolution (Elmagarmid, Ipeirotis, and Verykios 2007). The problem has grown in concert with the Semantic Web and with the publishing of new data on the Web. Given the prevalence of large datasets in data integration applications (Goodhue, Wybo, and Kirsch 1992), an ER solution must be scalable. For example, consider Linked Open Data (LOD2), which is the collection of RDF datasets published under an open license (Bizer, Heath, and Berners-Lee 2009). LOD currently contains over 30 billion triples and over 500 million property edges, published in over 300 datasets. Studies have suggested that LOD contains many syntactically disparate but semantically equivalent entities that have not yet been discovered and linked (Papadakis et al. 2010). In the relational domain, the Deep Web, which is the collection of back-end relational databases powering Web queries and faceted search, has also shown super-linear growth and is at
منابع مشابه
The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملComparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution
Entity resolution identifies semantically equivalent entities, e.g., describing the same product or customer. It is especially challenging for big data applications where large volumes of data from many sources have to be matched and integrated. Entity resolution for multiple data sources is best addressed by clustering schemes that group all matching entities within clusters. While there are m...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کامل2016 Olympic Games on Twitter: Sentiment Analysis of Sports Fans Tweets using Big Data Framework
Big data analytics is one of the most important subjects in computer science. Today, due to the increasing expansion of Web technology, a large amount of data is available to researchers. Extracting information from these data is one of the requirements for many organizations and business centers. In recent years, the massive amount of Twitter's social networking data has become a platform for ...
متن کاملTutorial: Uncertain Entity Resolution
Entity resolution is a fundamental problem in data integration dealing with the combination of data from different sources to a unified view of the data. Entity resolution is inherently an uncertain process because the decision to map a set of records to the same entity cannot be made with certainty unless these are identical in all of their attributes or have a common key. In the light of rece...
متن کاملCS 730 R : Topics in Data and Information Management – Big Data Analytics
The paper presents two concepts: entity resolution (ER, record linkage) and data privacy (DP). Authors presented a sketch of a framework for managing information leakage, and studied how the framework can be used to answer a variety of questions related to ER and DP. In the paper they studied the problems of measuring the incremental leakage of critical information. The framework bases on defin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015